@佛教经文分类Classification of Buddhist Verses

摘要

This study assesses the ability of machine learn-ing to classify verses from Buddhist texts into two categories: Therigatha and Theragatha, at-tributed to female and male authors, respec-tively. It highlights the difficulties in data pre-processing and the use of Transformer-based models on Devanagari script due to limited vo-cabulary, demonstrating that simple statistical models can be equally effective. The research suggests areas for future exploration, provides the dataset for further study, and acknowledges existing limitations and challenges.

研究目标：评估机器学习对Therigatha（女性作者）与Theragatha（男性作者）佛教经文的二分类能力
核心发现：
- 传统统计模型（SVC/朴素贝叶斯）AUC达0.88-0.89，优于所有Transformer模型
- Devanagari脚本因分词信息损失导致Transformer模型表现显著下降（AUC 0.76）
- 两类经文词汇重叠度仅10%，传统模型通过类别特有词汇即可有效分类
数据公开：提供1793节预处理经文数据集（GitHub: neveditsin/pali）

引言

研究背景

文本特性：
- Gatha为双行诗体，现存最早记载见于阿维斯塔经（公元前224-651年）
- 巴利语背景：研究文本（Theragatha和Therigatha）使用巴利语，这是一种与佛陀时代俗语（Prakrit）混合的语言，反映了早期佛教传播的语言特征。
作者争议：
- 32% Therigatha经文存在作者归属争议（Findly, 1999）
- 主题差异：Therigatha侧重苦难克服与社会约束（Blackstone, 2013）

数据集与预处理

1. 数据来源与结构

佛典出处：
数据源自巴利三藏（Tipitaka）的《小部》（Khuddaka-nikaya）中的《长老偈》（Theragatha）和《长老尼偈》（Therigatha），属于佛教经典中的“经藏”（Sutta-pitaka）。
章节划分：
- 按作者的诗句数量分章，如“Ekaka-nipaat”（单偈集）收录单人单偈，“Dukanipaat”（双偈集）收录单人双偈。
- Theragatha 含 21 章共 1288 偈，Therigatha 含 16 章共 524 偈，所有诗句按章节顺序编号。

2. 数据预处理

核心挑战：
- 标点符号标准化：不同scripts（天城文/罗马化）中标点符号的歧义性需统一处理规则。
- 文本补全（Peyaala 处理）：
  遇到“pe”（缩写标记，表示重复前文内容）时，需手动匹配上下文补全文本（因缺乏自动化工具）。
- 分词难题：
  巴利语复合词（如“muni”与“munin”）的分割规则灵活，需保留原始形态以避免语义损失。

3. 数据统计与清洗

原始数据量：
Theragatha（1288 偈） vs. Therigatha（524 偈），总计 1812 偈。
清洗规则：
排除 19 条无法解决分词歧义的诗句（3 条来自 Therigatha，16 条来自 Theragatha），最终保留1793 条有效数据。
词汇分布：